Very-Wide-Issue Superscalar Microengine Configurations

نویسندگان

  • Steve Bennett
  • Bin Wang
چکیده

To continue microprocessor performance improvements made in the last 2 decades, instruction-level parallelism must be exploited across multiple basic block boundaries. This necessity has led to execution engines which dynamically predict a stream of instructions which are executed concurrently. As issue widths increase, former assumptions about requirements for execution resources such as internal buses, renaming structures, memory ports and bypass paths no longer hold. In this study, we have modeled an aggressive superscalar implementation and studied the effects of limiting certain resources on the overall performance, as measured by total execution cycles. We have3 main conclusions: (1) bypass paths for instructions to issue directly to the execution units from the fetch mechanism are unjustified if instruction fetch prediction is very good; (2) the number of reservation stations should be made as large as possible after balancing all other resources; and (3) the number of result buses needed to avoid handicapping the rest of the execution engine grows proportionally to the issue rate with a constant of proportionality of approximately 0.25. 1.0 Background and Motivation Parallelism studies have shown that a great deal of instruction-level parallelism is present in both numerical and scalar codes [1][12]. To capitalize on this parallelism, an execution engine must decode and execute instructions across a wide window which spans many basic blocks. It is not clear exactly what form the hardware should take to realize performance improvements from this parallelism. Research work begun in the 1980’s [7] [9] [6] has led to many commercial superscalar processors. Recent superscalar microprocessor releases from Intel[2], AMD[8], DEC[4] and MIPS[5] display a vide variety of implementation techniques aimed at achieving high performance. They are based in part on exploiting instruction-level parallelism in the executing code. These processors expose parallelism by dynamically predicting branches to form a wide window of instructions. The window produced by the branch prediction may contain code from many non-consecutive basic blocks. From this dynamic instruction pool, multiple instructions are scheduled to execute each cycle. This requires a large hardware investment to create the window, to schedule and execute many instructions from it and to be able to recover from misprediction and exceptions. Where should design effort and silicon be dedicated? 2.0 Project Details 2.1 Machine Model We studied the execution resource requirements of a superscalar processor capable of issuing instructions from many (dynamically) sequential basic blocks in parallel. The machine configuration modeled is shown in Figure 3. Figure 3 Machine Configuration The portions of the machine that we are concerned with are shown in bold. Parameters we investigate below are: Memory Read/Write Register File RUU ... Functional Units (with Instruction Cache and Fetch/Prediction Logic Data Cache Functional Unit Functional Unit Functional Unit RUU Dispatch Bypass Path(s)

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Function Unit Clustering in Wide-Issue Superscalar Processors

As more function units are integrated into wideissue superscalar processors and as cycle times decrease, result-forwarding delays will become worse relative to processor cycle time. Physical distance and capacitive effects of smaller geometry wires are the main reason for this increase in delay. Thus, a full bypass network, able to forward results from any function unit to any other function un...

متن کامل

Simultaneous Multithreading: Maximizing On-Chip Parallelism - Computer Architecture, 1995. Proceedings., 22nd Annual International Symposium on

This paper examines simultaneous multithreading, a technique permitting several independent threads to issue instructions to a superscalar's multiple functional units in a single cycle. We present several models of simultaneous multithreading and compare them with altemative organizations: a wide superscalar, a fine-grain multithreaded processor, and single-chip, multiple-issue multiprocessing ...

متن کامل

Multiple Branch Prediction for Wide - Issue Superscalar ∗

Modern micro-architectures employ superscalar techniques to enhance system performance. Since the superscalar microprocessors must fetch at least one instruction cache line at a time to support high issue rate and large amount speculative executions. There are cases that multiple branches are often encountered in one cycle. And in practical implementation this would cause serious problem while ...

متن کامل

Evaluating a Multithreaded Superscalar Microprocessor versus a Multiprocessor Chip

This paper examines implementation techniques for future generations of microprocessors. While the wide superscalar approach, which issues 8 and more instructions per cycle from a single thread, fails to yield a satisfying performance, its combination with techniques that utilize more coarse-grained parallelism is very promising. These techniques are multithreading and multiprocessing. Multi-th...

متن کامل

Distributed Modulo Scheduling

Wide-issue ILP machines can be built using the VLIW approach as many of the hardware complexities found in superscalar processors can be transferred to the compiler. However, the scalability of VLIW architectures is still constrained by the size and number of ports of the register file required by a large number of functional units. Organizations composed by clusters of a few functional units a...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007